## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Most wine got quality rating 5 or 6. The median citric.acid is 0.26, and median pH is 3.31. At least 75% of red wine have density less than 1.
This is a histogram of quality. It has a normal distribution, and most of the red wines have rating 5 or 6.
This is a histogram of alcohol distribution. It’s a slighltly right skewed normal distribution. The mean is 10.42
This is a stacked histogram of alcohol distribution by quality. From the graph we can see that most 5, 6 quality rating wine have alcohol from 9 to 10.5.
The first graph is a normal histogram of residual sugar, it’s very much right skewed. The second graph transforms the x-axis into log10, the histogram is more normal distributed.
It is a histogram of citric acid. The graph does not have a normal distribution, it has 3 peaks around 0.1, 0.22, 0.5.
This is a histogram of pH. The graph is very much normally distributed, centered around 3.3.
The plot is a normal distribution. Most of the wine have density within the range 0.99 and 1.0.
Most of the variables are numeric variables, in other words, quantitative variables that may or may not affect the quality of the red wine. The quality of wine ranges from 3 to 8. The median and 75% quantile quality are both 6, which means many red wines have the quality rating of 6.
I’m interested in assessing what factors may affect the rating for the quality of red wine in the dataset.
I think alcohol, pH, residual sugar may help me find them, since they appear some trend of normal distribution, which is similar to the quality shape.
No. I haven’t. I am not familiar with acid, sugar, pH or alcohol etc enough to create new meaningful variables.
Looking at the scatterplots of fixed.acidity vs quality and volatile.acidity vs quality, the second one has a bit negative linear correlation, wheresa the first one doesn’t. Not sure how similar these two acidity are, but I thought the scatterplots would be similar. I’m going to do further correlation analysis in the next section.
Meanwhile, looking at the summary table. I found the max residual.sugar is 15.5, whereas the min is 0.9, not sure if it’s a recording error. Same for free.sulfur.dioxide, the min is 1 and the max is 72, not sure if it’s within the normal range.
After checking correlation between all variables, the following pairs have high correlation coefficient.
fixed acidity and volatile acidity 0.672; fixed acidity and alcohol 0.668; free sulfur dioxide and total sulfur dioxide 0.668; pH and citric.acid -0.542; volatile.acidity and citric.acid - 0.552; density and alcohol -0.496
Since I have transformed quality a ordered factorial variable, plots between quality and other variables are box plots.
From observation, we can see these pairs have a linear tendency. quality and volatile.acidity - negative; quality and citric.acid - positive; quality and sulphates - positive; quality and alcohol - positive;
The above graphs are scatterplots of quality versus other factors. For free sulfur dioxide and total sulfur dioxide, looks like higher the quality, lower the sulfur dioxide. For alcohol, looks like high quality red wine tend to have higher alcohol. For sulphates, there is a trend showing higher quality red wine have slightly higher sulphates
After running correlation tests, only alcohol and quality are more correlated in a linear model as the coefficient is about 0.47. All the other variables don’t have a strong enough correlation with quality.
I grouped the dataset by quality, and calculated mean alcohol level and median alcohol level for each group. The datapoints show median alcohol level for each datagroup. It does show a linear trend between quality and alcohol
##
## Pearson's product-moment correlation
##
## data: density and alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5322547 -0.4583061
## sample estimates:
## cor
## -0.4961798
From above scatterplot and correlation test of density and alcohol, we see that alcohol and density have a negative linear relationship.
From above plots we can see that pH and citric acid have a negative relationship.
alcohol and quality have a positive linear correlation. But quality do not seem to have a strong linear correlation with any other variables.
pH and citric have a negative linear correlation. density and alcohol also have a negative linear correlation.
In the next multivariate plots section, I used ggpairs to assess each variable against each other, and found that fixed acidity and volatile acidity have the strongest relationship with a coefficient number equals to 0.672.
This is a histogram of alcohol distribution by quality rating. Rating 5 and 6 have the highest peaks compared to other groups, this means they have the highest counts compared to other groups.
This is a histogram of residual sugar by quality rating. Almost all the groups have a normal distribution tendeny but slightly right skewed.
This is a histogram of pH by quality rating. All the groups have a very good normal distribution.
This is a histogram of density by quality rating. All groups have good normal distributions. Meanwhile, almost of alcohol have the density between 0.99 and 1.00 based on summary(alcohol) data.
This is a histogram of citric acid by quality rating. Citric.acid doesn’t have a very clear normal distribution trend. Each group have several peaks among them. I think this may be indicates there is no correlation between citric acid and quality.
##
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = rw)
## m2: lm(formula = as.numeric(quality) ~ alcohol + fixed.acidity, data = rw)
## m3: lm(formula = as.numeric(quality) ~ alcohol + fixed.acidity +
## volatile.acidity, data = rw)
## m4: lm(formula = as.numeric(quality) ~ alcohol + fixed.acidity +
## volatile.acidity + citric.acid, data = rw)
## m5: lm(formula = as.numeric(quality) ~ alcohol + fixed.acidity +
## volatile.acidity + citric.acid + residual.sugar, data = rw)
## m6: lm(formula = as.numeric(quality) ~ alcohol + fixed.acidity +
## volatile.acidity + citric.acid + residual.sugar + chlorides,
## data = rw)
## m7: lm(formula = as.numeric(quality) ~ alcohol + fixed.acidity +
## volatile.acidity + citric.acid + residual.sugar + chlorides +
## free.sulfur.dioxide, data = rw)
## m8: lm(formula = as.numeric(quality) ~ alcohol + fixed.acidity +
## volatile.acidity + citric.acid + residual.sugar + chlorides +
## free.sulfur.dioxide + total.sulfur.dioxide, data = rw)
## m9: lm(formula = as.numeric(quality) ~ alcohol + fixed.acidity +
## volatile.acidity + citric.acid + residual.sugar + chlorides +
## free.sulfur.dioxide + total.sulfur.dioxide + density, data = rw)
## m10: lm(formula = as.numeric(quality) ~ alcohol + fixed.acidity +
## volatile.acidity + citric.acid + residual.sugar + chlorides +
## free.sulfur.dioxide + total.sulfur.dioxide + density + pH,
## data = rw)
## m11: lm(formula = as.numeric(quality) ~ alcohol + fixed.acidity +
## volatile.acidity + citric.acid + residual.sugar + chlorides +
## free.sulfur.dioxide + total.sulfur.dioxide + density + pH +
## sulphates, data = rw)
##
## ====================================================================================================================================================
## m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11
## ----------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) -0.125 -0.794*** 0.674** 0.622** 0.626** 0.663** 0.691** 0.890*** 13.660 -21.976 19.965
## (0.175) (0.196) (0.218) (0.219) (0.219) (0.229) (0.237) (0.243) (17.684) (20.943) (21.195)
## alcohol 0.361*** 0.368*** 0.321*** 0.325*** 0.325*** 0.322*** 0.322*** 0.306*** 0.296*** 0.337*** 0.276***
## (0.017) (0.016) (0.016) (0.016) (0.016) (0.017) (0.017) (0.017) (0.022) (0.026) (0.026)
## fixed.acidity 0.071*** 0.036*** 0.056*** 0.056*** 0.055*** 0.054*** 0.043** 0.052** -0.007 0.025
## (0.010) (0.010) (0.013) (0.013) (0.013) (0.014) (0.014) (0.018) (0.026) (0.026)
## volatile.acidity -1.286*** -1.420*** -1.416*** -1.403*** -1.406*** -1.320*** -1.308*** -1.287*** -1.084***
## (0.099) (0.115) (0.115) (0.117) (0.118) (0.120) (0.121) (0.121) (0.121)
## citric.acid -0.314* -0.308* -0.283 -0.282 -0.133 -0.133 -0.151 -0.183
## (0.137) (0.138) (0.145) (0.145) (0.150) (0.150) (0.150) (0.147)
## residual.sugar -0.004 -0.004 -0.003 0.002 0.007 -0.008 0.016
## (0.012) (0.012) (0.012) (0.012) (0.014) (0.015) (0.015)
## chlorides -0.215 -0.217 -0.332 -0.321 -0.696 -1.874***
## (0.383) (0.383) (0.383) (0.383) (0.400) (0.419)
## free.sulfur.dioxide -0.001 0.004* 0.004* 0.005* 0.004*
## (0.002) (0.002) (0.002) (0.002) (0.002)
## total.sulfur.dioxide -0.003*** -0.003*** -0.003*** -0.003***
## (0.001) (0.001) (0.001) (0.001)
## density -12.797 25.114 -17.881
## (17.720) (21.370) (21.633)
## pH -0.611** -0.414*
## (0.194) (0.192)
## sulphates 0.916***
## (0.114)
## ----------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4
## adj. R-squared 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4
## sigma 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.6
## F 468.3 266.5 253.1 191.6 153.2 127.7 109.4 98.0 87.2 79.9 81.3
## p 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## Log-likelihood -1721.1 -1696.2 -1615.4 -1612.7 -1612.7 -1612.5 -1612.4 -1606.1 -1605.9 -1600.9 -1569.1
## Deviance 805.9 781.2 706.1 703.8 703.7 703.6 703.5 698.0 697.7 693.4 666.4
## AIC 3448.1 3400.5 3240.8 3237.5 3239.4 3241.1 3242.8 3232.2 3233.7 3225.7 3164.3
## BIC 3464.2 3422.0 3267.6 3269.7 3277.0 3284.1 3291.2 3286.0 3292.9 3290.2 3234.2
## N 1599 1599 1599 1599 1599 1599 1599 1599 1599 1599 1599
## ====================================================================================================================================================
This is a linear model of predicting the quality of red wine. I used variable quality as the first explanatory variable, and added one variable at a time to see the change of fitness of the model.
Yes. By looking at the box plot of quality against other variables. I found the following pairs have somewhat linear correlations despite the correlation test doesn’t show significant correlations.
quality and volatile.acidity - negative quality and citric.acid - positive quality and sulphates - positive quality and alcohol - positive
Some correlations are self explanatory from the name. For example, fixed acidity and volatile acidity, free sulfur dioxide and total sulfur dioxide, volatile.acidity and citric.acid. Although I’m not an expert on chemistry, but looks like themselves are pretty strongly correlated.
Yes. I created linear model for quality(outcome) vs alcohol(explainatory), and added each variable to the model.
Limitation of the model is the quality ranking is ranked by several people, and it could very subjective. The strengths we found that some factors like alcohol and sulphates may affect the taste of red wine, and thus affect how people perceive its quality.
The graph shows the distribution of quality of red wine in the dataset. From there we see many of them got rating of 5,6, and the shape of the graph is relatively normal distribution.
From the summary(quality) table above, we know that mean = 5.65, median = 6, 3rd quartile = 6. This tell us that 75% of red wine got rating 6 and bellow, and most of them got rating 5,6.
The above plot shows the correlation between density and alcohol by quality. Most of the wine have density between 0.99 and 1.00, which is not a wide range. Meanwhile, for each group of quality, there is a clear negative linear correlation between alcohol and density. The smaller the density, the higher the alcohol. We can also see the difference between different quality ratings. Higher quality wine have alcohol level higher than lower quality wine. We can see from the graph that line for 8 is almost parallel to the line for 7.
From the boxplot of alcohol and quality, we can see that the higher the quality, the higher the alcohol is. There is a positive linear relationship between quality and alcohol. Quality 5 wine has more outliers than other quality groups, and the mean of it is slightly lower than other groups. However, other groups have very little outliers, basically all the wine are within 1st and 3rd quartile alcohol level. Quality 8 has the highest mean than any other group.
Through the analysis of the data, I found that only alcohol varialbe is correlated with red wine quality. Seems like professional wine tasters tend to like stronger alcohol.
As the quality rating is relatively subjective, the dataset could include varaibles for wine tasters, and include as many different wine tasters as possible. Meanwhile, wine quality could be related to some other variables that the data set doesn’t include, like grape types, year of production and origins etc.
In addition, there could be further analysis to assess the correlation between other variable pairs. Maybe new variables need to be created based on log, sqrt or multiplication of current varialbes. But I’m not familiar with chemicals enough to make any assumptions.